Nearest neighbor clustering determination and estimation algorithm that hashes centroids into buckets and redistributes vectors between clusters

ABSTRACT

Embodiments are described for determining and/or estimating a nearest neighbor to a data vector in a dataset are presented. Some embodiments redistribute data vectors between clusters based upon the character of the clusters to more evenly balance the computational load. Some embodiments employ Locality Sensitive Hashing (LSH) functions as part of the clustering and remove redundant data vectors from the data set to mitigate unbalanced computation. The disclosed embodiments may facilitate the analysis of very large and/or very high dimensional datasets with reasonable runtimes.

TECHNICAL FIELD

The disclosed embodiments relate to data analysis, particularlyprocessing distribution and efficiency, e.g., for identifying nearestneighbors in a dataset.

BACKGROUND

Large datasets are increasingly readily available from a wide variety ofdifferent domains. For example, websites regularly acquire very largedatasets of consumer purchasing information, weather monitoring systemsregularly acquire a wide variety of measurement data, securityinstallations acquire a wide variety of image data, etc. Sifting throughthese copious datasets can be onerous and costly, but the rewards foridentifying meaningful patterns and trends can be great.

Machine learning and data analysis tools offer one possible method forautomating the analysis process. These tools can classify data anddetermine associations between different datasets. However, when thedatasets become exceptionally large these tools may be unable todetermine meaningful relationships in a reasonable amount of time. Forexample, in an application comparing images, e.g. to recognize commonobjects in each image, each additional image will exponentially increasethe imposed load as the Nth image must be compared with each of thepreceding N-1 images. For even a small 64×64 grayscale pixel image, thiscan be a considerable constraint (64×64=4096 dimensions per image).

Accordingly, there exists a need to more accurately and to more quicklydetermine associations among data points in large datasets. Inparticular, in many situations it would be ideal to determine or toestimate a nearest neighbor to a target data vector in a dataset quicklyand efficiently.

BRIEF DESCRIPTION OF THE DRAWINGS

The techniques introduced here may be better understood by referring tothe following Detailed Description in conjunction with the accompanyingdrawings, in which like reference numerals indicate identical orfunctionally similar elements:

FIG. 1 is an example plot of data vectors in a two dimensional space asmay be provided for a nearest neighbors search in some embodiments.

FIG. 2 is a flow diagram depicting a K-Means routine as may beimplemented using various of the disclosed embodiments.

FIG. 3 is a graphical depiction of an example iteration of the K-Meansroutine depicted in FIG. 2.

FIG. 4 is a flow diagram depicting a high-level overview of a LSHindexing routine as may be used in some embodiments.

FIG. 5 is a flow diagram depicting a high-level overview of the LSHsearch routine as may be used in some embodiments.

FIG. 6 is a graphical depiction of an example application of the LSHindexing and searching operations to a dataset as implemented in someembodiments.

FIG. 7 is a flow diagram depicting a high-level overview of the LSHoperations applied to K-means as may be used in some embodiments.

FIG. 8 is an example block diagram of the processing distribution of thenearest neighbors determination as may occur in some embodiments.

FIG. 9 is an example redistribution of clustered points via adistance-based reassignment as may be used in some embodiments.

FIG. 10 is an example redistribution of clustered points via a randomreassignment as may be used in some embodiments.

FIG. 11 is a flow diagram depicting a nearest neighbor determination asmay occur in some embodiments.

FIG. 12 is an example plot depicting certain clustering statistics asmay be used in some embodiments.

FIG. 13 is a block diagram of a re-clustering module as may beimplemented in some embodiments.

FIG. 14 is a flow diagram depicting a re-clustering operation as mayoccur in some embodiments.

FIG. 15 is a block diagram depicting redundant vector behavior in LSH asmay occur in some embodiments.

FIG. 16 is a flow diagram for a process for identifying and optionallyremoving redundant vectors as may occur in some embodiments.

FIG. 17 is a block diagram of a computer system as may be used toimplement features of some of the embodiments.

The headings provided herein are for convenience only and do notnecessarily affect the scope or meaning of the claimed embodiments.Further, the drawings have not necessarily been drawn to scale. Forexample, the dimensions of some of the elements in the figures may beexpanded or reduced to help improve the understanding of theembodiments. Similarly, some components and/or operations may beseparated into different blocks or combined into a single block for thepurposes of discussion of some of the embodiments. Moreover, while thevarious embodiments are amenable to various modifications andalternative forms, specific embodiments have been shown by way ofexample in the drawings and are described in detail below. Theintention, however, is not to limit the particular embodimentsdescribed. On the contrary, the embodiments are intended to cover allmodifications, equivalents, and alternatives falling within the scope ofthe disclosed embodiments as defined by the appended claims.

DETAILED DESCRIPTION

Overview

Various of the disclosed embodiments relate to determining data vectorsresembling other data vectors in a dataset. For example, in someembodiments as user may desire to identify the “nearest neighbor” to adata vector, e.g., the data vector sharing the most similar values ofall the data vectors in the dataset. For very large datasets it may beimpractical to determine the nearest neighbor by brute force. Instead,as occurs in some embodiments, the dataset may be grouped into similardata vectors and then searches performed on these groups by a collectionof distributed computing systems. In this manner, the computational loadmay be divided and the result may be more readily achieved.

The grouping of the data vectors may be accomplished using variousclustering methods, e.g., K-means, as discussed in greater detailherein. K-means may itself be optimized using Locality Sensitive Hashing(LSH) techniques, also summarized below. Various embodiments adjust andoptimize these procedures to improve their use in identifying nearestneighbors. However, in some cases the disclosed improvements extendbeyond the direct application of these methods to nearest neighbors andmay be used in general categorization routines.

FIG. 1 is an example plot 100 of data vectors in a two dimensional spaceas may be provided for a nearest neighbors search in some embodiments.In this example, the data vectors are depicted as black circlesdistributed across the two dimensions of the page (though one willrecognize that actual datasets may comprise hundreds or thousands ofdimensions). Given a target data vector 110 a, a user may desire toidentify a nearest neighbor, e.g., data vector 110 b. Although one couldidentify the nearest neighbor by a brute force comparison of all thedata vectors, this is generally computationally “expensive.” Aclustering routine, e.g., K-means, may be used to cluster groups 105 a-cof data points and an analysis then performed only among data vectors inthe same cluster as the target data vector, e.g., cluster 105 c. Whilemore efficient, this process may introduce errors. For example, only asingle data vector is present in cluster 105 b and so the system willnot be able to identify the nearest neighbor for that single data vectorin this approach. Optimizing the clustering of the data vectors may moreefficiently produce results and may also achieve more accurate results.

General Description

Various examples of the disclosed techniques will now be described infurther detail. The following description provides specific details fora thorough understanding and enabling description of these examples. Oneskilled in the relevant art will understand, however, that thetechniques discussed herein may be practiced without many of thesedetails. Likewise, one skilled in the relevant art will also understandthat the techniques can include many other obvious features notdescribed in detail herein. Additionally, some well-known structures orfunctions may not be shown or described in detail below, so as to avoidunnecessarily obscuring the relevant description.

The terminology used below is to be interpreted in its broadestreasonable manner, even though it is being used in conjunction with adetailed description of certain specific examples of the embodiments.Indeed, certain terms may even be emphasized below; however, anyterminology intended to be interpreted in any restricted manner will beovertly and specifically defined as such in this section.

K-Means Operations

FIG. 2 is a flow diagram depicting a K-Means routine 200 as may beimplemented using various of the disclosed embodiments. The K-Meansroutine may be used to identify groups of “similar” data, e.g., datahaving many near data points. The system may begin with a datasetcomprising multiple data vectors. The data vectors may be highdimensional data. As one example the dataset may be a collection of64×64 grayscale pixel images, with each vector representing an image andconstituting 4096 values. A wide variety of datasets other than imagesmay be used (e.g., word counts, time values, etc.).

Given the data, the routine 200 may generate K initial points forclassification, e.g., randomly, at block 205. In subsequent iterationsthese points may be replaced with centroids of the respective clusters.Though referred to as a “centroid,” one will recognize the terms“average,” “mean,” and/or “Nth-moment” as describing a similar concept.The number K of centroids may be determined based on a number offactors, e.g., the desired categorization, the nature of the data, etc.

At block 210, the routine 200 may determine K new clusters of data byassociating each data vector with a nearest centroid.

At decision block 215, the routine 200 may determine if an appropriateend condition has been reached. For example, the routine may stop aftera specified number of iterations, or when each successive centroid isless than a specified threshold distance from its predecessor in theprevious iteration.

If the end condition has not been reached at block 215, the routine 200may proceed to block 220 where K new centroids are determined based uponthe clusters. For example, where four data vectors are associated with acluster, the preceding centroid may be removed and the average of thedata vectors' values may determine the values of a new centroid.

While the flow and sequence diagrams presented herein show anorganization designed to make them more comprehensible by a humanreader, those skilled in the art will appreciate that actual datastructures used to store this information may differ from what is shown,in that they, for example, may be organized in a different manner; maycontain more or less information than shown; may be compressed and/orencrypted; etc.

FIG. 3 is a graphical depiction of an example iteration of the K-Meansroutine depicted in FIG. 2. In the 2-dimensional example of FIG. 3, atstate 315 a, a collection of data vectors (represented by solid blackcircles) may be received for categorization. Initially, three centroids305 a-c (K=3) may be randomly generated. One will recognize, asdescribed above, that in many applications there will be many more thantwo dimensions of the data. State 315 a generally corresponds to theresult of executing the logic of block 205 of FIG. 2.

At state 315 b, generally corresponding to block 210, bisectors 325 a-cdivide the centroids 305 a-c (represented as shapes outlined with solidlines) into a Voronoi diagram. One skilled in the art will recognize abisector 325 a-c as, e.g., a line or plane dividing two regions. AVoronoi diagram is a division of a space into regions based upon acollection of points (in this example, centroids). Each region of theVoronoi diagram represents a “clustering” of data. In state 315 b, thedata vector 325 is clearly associated with the cluster generated aroundthe centroid 305 c. In state 315 b the data vector 320 is near thebisector 325 c between two clusters.

At state 315 c, generally corresponding to block 220, new centroids 310a-c (represented as shapes outlined with dashed lines) are calculatedbased upon the data vectors for each cluster. As indicated in state 315d, the new centroids 310 a-c result in a new Voronoi clustering withbisectors 330 a-c. In this new clustering, the data vector 320 isfarther from the bisector 330 c and more clearly associated with thecluster of centroid 310 c.

Application of Locality Sensitive Hashing to K-Means

While the K-means routine of FIGS. 2-3 may properly cluster the dataset,block 210 identifies a centroid nearest each data vector. A naïve,brute-force approach may be unsuitable when there are many centroids andmany data vectors. Indeed, time and storage space (e.g., memory)requirements may grow exponentially with increases in the number ordimensions of the vectors.

Accordingly, some embodiments approximate the nearest centroid usingLocality Sensitive Hashing (LSH). This approximation may be suitable inmany instances and may greatly reduce the computation time and space.LSH is a method for performing probabilistic dimension reduction, and somay be useful for identifying similar data as it reduces the searchspace for identifying nearest neighbors. LSH groups similar data basedupon multiple randomly generated projections. Determinations of nearestneighbors with LSH may be estimated, rather than exact, because LSHemploys probabilistic methods discussed below.

LSH Features

FIG. 4 is a flow diagram illustrating a high-level overview of a LSHindexing routine 400 as may be used in some embodiments. Although forconvenience the figures describe the operations as performed upon a“data vector” one will recognize that centroids may be included in theanalysis as well. For example, both data vectors and centroids may beplaced into bins in the described manner so that centroids nearest agiven data vector may be subsequently identified. A centroid may itselfbe a vector, though not a data vector included in the original dataset.At block 405 the system may create a new hash table for storing results.At block 410, the system may generate m random vectors with the samedimensionality as the data vectors, e.g., where each coordinate is aGaussian random variable N(0, 1), and optionally with a scalar bias termto reduce bias. In some embodiments, a uniform rather than a Gaussianrandom variable may be used without bias. Other variations will bereadily recognized.

At block 415, the system may select the next data vector, project itupon each of the m random vectors based on a quantization factor w, andplace it in the m-dimensional bin that corresponds to the m projections.

At block 420, the system may reduce the index of the m-dimensional bininto a one-dimensional index.

If not all the data vectors have been projected and hashed, then atdecision block 425 the system may repeat the operations with the nextdata vector to be considered. If all the data vectors have beenconsidered, then the system may proceed to decision block 430.

If at decision block 430 the desired number of hash tables (L) have beengenerated, the process may end. Conversely, if additional tables are tobe generated the process blocks may be repeated from block 405 such thata new collection of hashes are generated for a new table. In thismanner, successive categorizations may be performed across successivetables with randomized hashes to achieve a more accurate, aggregatecategorization of the data vectors than would be achieved by a singletable with a single collection of m randomized vectors.

Having performed the indexing operation of FIG. 4, embodiments maysubsequently identify members of a same bin for processing. For example,given a data vector or centroid, embodiments may identify centroids/datavectors sharing the same bins as the data vector/centroid across thetables L. A nearest neighbors search may then be performed across onlythese similarly binned results.

FIG. 5 is a flow diagram depicting a high-level overview of the LSHsearch routine 500 as may be used in some embodiments. Given a “queryvector” (e.g., a data vector, centroid, etc.) the search operation mayidentify other vectors sharing the same bin as the “query vector”. Atblock 505, the system may select the next table among the L tablesgenerated when indexing. At block 510, the system may identify datavectors (and/or centroids) which have been placed in the same bin as thequery vector. The identified data vectors (and/or centroids) not alreadyincluded in the accumulated list may be added to the list of vectors forevaluation. At decision block 515, the routine may determine if all Ltables have been considered. If so, the routine continues at block 520.Otherwise, the routine returns to block 505. Once vectors that sharebins with the query vector have been identified across all L tables,embodiments may perform the desired operation, e.g., calculation of thenearest neighbor at block 520 using the subset of identified vectors.

FIG. 6 is a graphical depiction of an example application 600 of the LSHindexing and searching operations to a dataset as implemented in someembodiments. The dataset may comprise multiple data vectors 605 a-c. Forpurposes of explanation, the indexing of the data vector 605 c isdepicted. Following the generation of the random vectors for projection610 a-c (m=3) the data vector 605 c is projected 615 a-c upon eachrandom vector 610 a-c. The random vectors 610 a-c are quantized by afactor w. In view of this quantization, the data vector 605 c isprojected upon the regions A3, B3, and C3 of the respective randomvectors 610 a-c. These values 620 a are hashed in Table 1 to a first BinIndex 1. The data vector 605 b is sufficiently close to the data vector605 a that it may share the same projection values 620 b and mayaccordingly be hashed to the same Bin Index 1. In contrast, the datavector 605 a is sufficiently different from the other data vectors thatit projects upon the regions A2, B1, and C2. These values 620 caccordingly hash to a different Bin Index 2. The generation of the mvectors 610 a-c, projections, and hashings is repeated for all L tables625.

Once the L tables 625 have been generated, a data vector's fellow binmembers may be identified by iterating 630 a-L across each of the tables625 (e.g., using the routine of FIG. 5). These candidates may beconsidered together, e.g., to identify the nearest neighbor 635 to thequeried data vector.

Locality Sensitive Hashing Integration with K-Means

FIG. 7 is a flow diagram depicting a high-level overview of the LSHoperations applied to K-means as may be used in some embodiments. Thedepicted operations may occur as part of, or preceding, block 210. Atblock 705 the system may apply LSH to identify a subset of all possiblevectors for consideration, e.g., when determining a nearest centroid toa data vector. Once the subset has been determined using LSH, at block710 a nearest neighbor search need be performed only among the subset.For example, at block 210 the system may determine which of 10,000centroids a target data vector is closest to (e.g., when clusteringimages of faces among 10,000 individuals). In this example, LSH mayidentify only 80 centroids as being within the same bin as the targetdata vector. Thus, the system may only identify the nearest centroidfrom this 80 centroid subset to classify the data vector. Unfortunately,LSH is an approximation and the 80 centroids may exclude the actual,closest centroid. Accordingly, as discussed in greater detail below,various embodiments address and compensate for errors produced by LSH toachieve a more accurate result. Various of the disclosed embodimentsapply improvements upon LSH and K-Means as described in the co-pendingapplication CLUSTERING USING LOCALITY-SENSITIVE HASHING WITH IMPROVEDCOST MODEL, filed on the same day as this filing and incorporated byreference herein in its entirety for all purposes. Various of theimprovements of this incorporated application may be used with LSH aspart of the K-means process, or with LSH as LSH is used independently inother applications.

Example Nearest Neighbor Computational Distribution

FIG. 8 is an example block diagram of the processing distribution 800 ofthe nearest neighbors determination as may occur in some embodiments. Anoriginal dataset 820 may be provided to the system having the datavectors depicted by black circles in the figure. A primary processingsystem 810, or a separate system assisting the primary processing system810, may group the data vectors into clusters 805 a-c, e.g., by usingK-means or another grouping routine. The primary processing system 810,may then distribute the clusters for nearest neighbor identification.For example, the cluster 805 a may be distributed to processing system815 a for analysis, the cluster 805 b may be distributed to processingsystem 815 b for analysis, and the cluster 805 c may be distributed toprocessing system 815 c for analysis. Processing systems 815 a-c maythen report the results of their analysis, e.g., back to primaryprocessing system 810 or output to a user. Processing systems 810, 815a-c may be the same or similar as the system described with respect toFIG. 17.

In some instances, however, the number of data vectors within theclusters 805 a, referred to as the “cardinality” of the cluster, mayvary considerably. If one cluster contains only a few data vectors whileanother contains 100× more, there may be little benefit to distributingthe two cluster's computation across separate processing systems.Rather, it would probably be desirable to more evenly distribute thedata vectors among the clusters so that computation is more evenlydistributed across the processing systems 815 a-c. Naturally, one willrecognize topologies other than the example of FIG. 8 in which variousof the disclosed embodiments may apply.

Cluster Data Vector Redistribution

FIG. 9 is an example redistribution of clustered points via adistance-based reassignment as may be used in some embodiments. In thisexample, initially in state 905 a a handful of data vectors areassociated with the cluster 915 a, which is associated with centroid 910a (e.g., following a K-means iteration). Conversely, a much largernumber of data vectors is associated with the cluster 915 b associatedwith centroid 910 b. To more evenly distribute computation, data vectorsin the cluster 915 b may be re-associated with data vectors in thecluster 915 a.

Naturally, there are many, many ways to redistribute the data vectorsonce the desirability of the redistribution has been identified. FIGS. 9and 10 provide two possible examples. In FIG. 9 the distance of the datavectors is taken into consideration and in the state 905 b a new cluster920 a with centroid 925 a may be created which integrates data vectorsfrom the previous clusters 915 a and 915 b. Though the terms “new” and“previous” may be used herein in regard to clusters, one will recognizethat in an actual implementation the same data structures may be reusedrather than recreated (e.g., a cluster objects contents may beredefined, or a parameter depicting cluster association in a data vectormay be modified).

While consideration of the distance during redistribution as in FIG. 9may improve accuracy in some instances, it may be too computationallyexpensive in some cases. Accordingly, in some cases it may be moreefficient to redistribute the data vectors randomly followingidentification. FIG. 10 is an example redistribution of clustered pointsvia a random reassignment as may be used in some embodiments. As in theexample of FIG. 9, in state 1005 a two clusters 1015 a and 1015 b havebeen identified for redistribution. Following redistribution in state1005 a, the data vectors have been randomly reassigned to each cluster(indicated by the white and black circles). As a consequence, thecentroids of the two new clusters 1020 a and 1020 b may be roughlyco-located. In some embodiments, the system may seek to ensure thatroughly half of the cumulative data vectors are assigned to one clusterand half are assigned to the other. Other variations of the exampleredistribution methods of FIGS. 9 and 10 will be readily recognized byone skilled in the art.

Example Process for Clustering Redistribution

FIG. 11 is a flow diagram depicting a nearest neighbor determination asmay occur in some embodiments.

After receiving a dataset, at block 1105 the system (e.g., the primaryprocessing system 810) may determine a preliminary clustering of thedata (e.g., by applying K-means). At block 1110, the system maydetermine statistics for the preliminary clustering. For example, asdescribed in greater detail with respect to FIG. 12, the second closestcentroid, cardinality, and other features of the clusters may bedetermined. In some embodiments, these statistics may be made availableas part of the preliminary clustering process.

The blocks contained in region 1115, may be part of a “split and remove”operation (though, as discussed above, clusters may simply be redefinedrather than new cluster categorizations created and old clustercategorizations removed). In the blocks in region 1115, clusters may beidentified as being “too large” and in need of redistribution of theircontents (“splitting”). Conversely, clusters can also be identified whomay accept the data vectors being removed. In some embodiments, theseclusters are identified based on the potential error introduced by theirreceiving additional data vectors.

Accordingly, at block 1125 the system may identify oversized clusters tosplit. At block 1120, the system may also identify clusters who mayreceive additional data vectors while minimizing any consequentialerror. The identification may be based, at least in part, upon thestatistics acquired at block 1110. Once appropriate data clusters havebeen identified, the system may redistribute their data vectors as block1130.

The redistributed clusters may then be passed to the distributedprocessing systems (e.g., 815 a-c) at block 1135. As discussed above,the results may then be received back at the primary processing systemat block 1140, e.g. for consolidation, or provided directly to a user.

Cluster Statistics

FIG. 12 is an example plot depicting certain clustering statistics asmay be used in some embodiments. For example, certain of thesestatistics may be collected at block 1110. For a cluster 1225, thecardinality of the cluster (in this example 31) may be determined andstored for later use. In addition, statistics may be collected todetermine what error may result if the cluster were used to receiveanother cluster's data vectors. One metric for assessing this potentialerror, is to determine the distance to the second-closest centroid. Inthis example, the first closest centroid to the centroid 1215 iscentroid 1205 (distance D1=22) and the second closest centroid iscentroid 1210 (distance D2 which is 37).

The depicted example of FIG. 12 illustrates the calculation of the firstand second closest centroids relative to the centroid 1215 of a cluster.In some embodiments, this single determination will be reused for allthe vectors in cluster 1225. However, in other embodiments a moregranular approach may be taken, wherein the first and second closestcentroids are separately determined for each of vectors in the cluster1225. For example, vector 1230 a may be associated with a first andsecond closest centroid pair which may or may not be the same as thatassociated with vector 1230 b.

Determination of the second closest centroid may itself involve anearest neighbors search (e.g., even a distributed search as the onedescribed herein). Optimizations, e.g., LSH, may be applied andoptimizations to LSH, e.g., as are discussed herein or incorporated byreference, may also be used to improve the identification the firstand/or second closest centroid to a cluster.

Example Clustering Module

FIG. 13 is a block diagram of a re-clustering module 1305 as may beimplemented in some embodiments. The re-clustering module 1305 maycomprise firmware, software, or hardware (e.g., an FPGA) configured toperform operations from FIG. 14. The re-clustering module 1305 mayreceive one or more inputs, which may include clustering statistics, athreshold on the number of “top cardinality clusters” to consider and aminimum cardinality ratio for determining pair matching (betweenoversized clusters and clusters able to receive addition data vectors).One will recognize that the inputs may take a variety of different forms(e.g., an absolute value rather than a ratio, a function, etc.).Similarly, the inputs may be provided via a user or via automation.

The clustering statistics may include various data, e.g., thecardinality of a cluster, and the distance to the second-closestcentroid of another cluster, e.g., as described in FIG. 12. Inembodiments where pairs of first and second closest centroids areidentified for each individual vector in a cluster, as discussed in FIG.12, the sum of distances to the second closest centroid (across thevectors of the cluster) may be included in the statistics rather than asingle pairing for the entire cluster. The number of top-cardinalityclusters may be used to indicate how many of the largest clusters are tobe redistributed. For example, if past experience indicates that 10-12large clusters result for a given domain, this parameter may be set to10 or 12. The minimum cardinality ratio minimizes errors resulting fromthe redistribution by imposing a constraint on which smaller clusterscan be selected for redistribution with the data vectors of a largercluster.

One will recognize that the top-cardinality clusters parameter iscomplementary in some respects to the minimum cardinality ratioparameter. That is, the former addresses the issue of too largeclusters, while the latter ensures that the data vectors areredistributed in an error-minimizing manner. One will readily recognizeother forms these complementary parameters may take. For example, amaximum cardinality value may instead indicate the cardinality sizeabove which all clusters are to be split. Similarly, a minimum errormetric based on centroid distance, rather than the cardinality ratio,may be used in lieu, or in conjunction with, the minimum cardinalityratio. Accordingly, the module 1305 may generally receive twocomplementary parameters identifying a feature of the maximumcardinality clusters and of the clusters for receiving excessive datavectors.

In some embodiments, the module 1305 may perform the clusterredistributions itself. In others, as depicted, the module may output anindication of the redistribution operations to be performed and anothermodule or system may actually implement the adjustments. Other outputs,e.g. regarding the computational efficiency, may also be output forfeedback determinations.

Example Process for Cluster Selection and Adjustment

FIG. 14 is a flow diagram depicting a re-clustering operation 1400 asmay occur in some embodiments. The depicted operation 1400 is merely oneexample and one will recognize that variations in the depicted processesmay have the same or similar effect. At block 1405 the system may sortthe clusters by cardinality into a sorted list, e.g., using the clusterstatistics. At block 1410 the system may sort the clusters by the secondclosest centroid distances (either the distance associated with thecentroid of the cluster or the accumulated distance for each vector inthe cluster) into a second sorted list, again, e.g., using the clusterstatistics.

At decision block 1415 the system may determine whether it has addressedthe desired number or type of high-cardinality clusters specified via aninput parameter. In this instance, the system considers if the maximumnumber of top-cardinality clusters have been considered (for example,where the top-cardinality input parameter is 15, the first 15 largestclusters of the first list). If so, the process returns.

Alternatively, at block 1420 the system may consider the next highestcardinality cluster, referred to as a “cardinality cluster” (CC), fromthe first ordered list. As discussed in greater detail below, the CC maybe set aside at decision block 1425 if it contains too many duplicatevectors. Whether the set-aside CC is considered as one of the clusterscontributing to reaching the top-cardinality threshold is a designchoice left to the implementer (in some embodiments the omitted clustercontributes, and in others does not). At block 1430 the next clusterfrom the second sorted list, referred to as a “second distance cluster”(SDC) may be considered. If the SDC's cardinality is too great relativeto the CC, as based, e.g., on an input parameter (e.g. a cardinalityratio between the two clusters), the SDC may be set aside in favor ofanother cluster at decision block 1435.

Additional matching algorithms may be considered in some embodiments.For example, the problem may be reformulated as “bipartite matching”where the left partition contains the high cardinality clusters and theright partition contains low sum second centroid clusters. Weightsbetween the left and right values may capture how much cardinalitybalance will improve and how much error will be added consequently froman adjustment (for example by a linear combination of the two). Astandard bipartite matching algorithm may then be used to determinewhich clusters to split and which to remove.

Alternatively, at block 1440, if the cardinality relation is acceptable,the system may designate the CC for “splitting” and the SDC for“removal”. As discussed above, the same, or similar redistribution maybe achieved by a variety of methods

(See, e.g., FIGS. 9 and 10) and the clusters may not be literally“split” or “removed”, but the constituent data vectors reassigned toachieve the desired effect.

Redundant Vectors

Various of the embodiments employ further optimizations to thenearest-neighbor determination, either prior or during redistribution,or once the cluster has been delivered to processing system 815 a foranalysis. One optimization employed in some embodiments is theidentification and removal of “redundant vectors”.

Despite the high dimensionality of many datasets, in some instancesseveral of the data vectors may have identical, or near identicalvalues. These redundant vectors may adversely influence the computationof the nearest neighbors in some instances. For example, FIG. 15 is ablock diagram depicting redundant vector behavior in LSH as may occur insome embodiments. As discussed in FIG. 6 the data vectors may beiteratively projected and indexed as part of the LSH process. Asdiscussed herein, various embodiments employ an improved LSH process,e.g., as part of the K-means calculation, which may itself be used fornearest-neighbor determination or for cluster redistribution, asdiscussed in the application incorporated by reference above.

In FIG. 15, the data vectors 1510 and 1515, are relatively dispersed.However, the data vectors 1505 a-f generally share the same or similarvalues. As a consequence, during LSH each of the data vectors 1505 a-fwill be projected onto a same quantized bins 1520 a-b of a plurality ofrandom vectors 610 a-c. Again, and again, through many or all iterationsof the tables 1530, these redundant vectors may fall within the same binindex. Such repetition may adversely weight the LSH calculation andresult in an unbalanced computation. In some embodiments, the existenceof redundant vectors is determined, and the redundancy reduced to asingle representative vector to mitigate the problem. This improvementmay be applied wherever LSH is used as discussed herein, e.g., duringK-means calculations, second nearest centroid determination,nearest-neighbor identification, etc.

FIG. 16 is a flow diagram for a process 1600 for identifying andoptionally removing redundant vectors as may occur in some embodiments.This is merely one possible example implementation and one will readilyrecognize variations to achieve the same or similar effect. At decisionblock 1605, the system may determine whether enough search attempts havebeen made to identify redundant vectors, e.g., based upon a user inputor an automated appraisal of the cluster's character.

If additional searches are to be performed, at block 1610 the system mayincrement the record of the number of search attempts. At block 1615 thesystem may randomly select a corpus of data vectors from the data setunder consideration (for example, a subset comprising 5% of all the datavectors). Within this corpus, the system may seek to identify redundantvectors at decision block 1620. Where the randomly selected corpus isrepresentative of the data set as a whole, the existence of redundanciesin the corpus will reflect additional redundancies in the entire dataset. If such redundancies are identified in the corpus (e.g., in excessof a threshold), at block 1625 the system may identify the data vectoras a duplicate. Simultaneously, or in conjunction with another system,the redundant vectors may be removed throughout the entire data set, sothat, e.g., only a single representative remains. LSH may then beapplied as described above to the dataset. In some embodiments, thesystem may find at most one redundant vector per cluster to improveefficiency.

At block 1630 the system may decide to begin the search anew, forexample, by resetting the search attempts counter after a redundantvector has been identified. In this manner, if multiple classes ofredundant vectors exist (a first repeating a first set of values, asecond repeating a second set of values, etc.) they may each besuccessively identified and removed before applying LSH.

Computer System

FIG. 17 is a block diagram of a computer system as may be used toimplement features of some of the embodiments. The computing system 1700may include one or more central processing units (“processors”) 1705,memory 1710, input/output devices 1725 (e.g., keyboard and pointingdevices, display devices), storage devices 1720 (e.g., disk drives), andnetwork adapters 1730 (e.g., network interfaces) that are connected toan interconnect 1715. The interconnect 1715 is illustrated as anabstraction that represents any one or more separate physical buses,point to point connections, or both connected by appropriate bridges,adapters, or controllers. The interconnect 815, therefore, may include,for example, a system bus, a Peripheral Component Interconnect (PCI) busor PCI-Express bus, a HyperTransport or industry standard architecture(ISA) bus, a small computer system interface (SCSI) bus, a universalserial bus (USB), IIC (12C) bus, or an Institute of Electrical andElectronics Engineers (IEEE) standard 1394 bus, also called “Firewire”.

The memory 1710 and storage devices 1720 are computer-readable storagemedia that may store instructions that implement at least portions ofthe various embodiments. In addition, the data structures and messagestructures may be stored or transmitted via a data transmission medium,e.g., a signal on a communications link. Various communications linksmay be used, e.g., the Internet, a local area network, a wide areanetwork, or a point-to-point dial-up connection. Thus, computer readablemedia can include computer-readable storage media (e.g., “nontransitory” media) and computer-readable transmission media.

The instructions stored in memory 1710 can be implemented as softwareand/or firmware to program the processor(s) 1705 to carry out actionsdescribed above. In some embodiments, such software or firmware may beinitially provided to the processing system 1700 by downloading it froma remote system through the computing system 1700 (e.g., via networkadapter 1730).

The various embodiments introduced herein can be implemented by, forexample, programmable circuitry (e.g., one or more microprocessors)programmed with software and/or firmware, or entirely in special-purposehardwired (non-programmable) circuitry, or in a combination of suchforms. Special-purpose hardwired circuitry may be in the form of, forexample, one or more ASICs, PLDs, FPGAs, etc.

Remarks

The above description and drawings are illustrative and are not to beconstrued as limiting. Numerous specific details are described toprovide a thorough understanding of the disclosure. However, in certaininstances, well-known details are not described in order to avoidobscuring the description. Further, various modifications may be madewithout deviating from the scope of the embodiments. Accordingly, theembodiments are not limited except as by the appended claims.

Reference in this specification to “one embodiment” or “an embodiment”means that a particular feature, structure, or characteristic describedin connection with the embodiment is included in at least one embodimentof the disclosure. The appearances of the phrase “in one embodiment” invarious places in the specification are not necessarily all referring tothe same embodiment, nor are separate or alternative embodimentsmutually exclusive of other embodiments. Moreover, various features aredescribed which may be exhibited by some embodiments and not by others.Similarly, various requirements are described which may be requirementsfor some embodiments but not for other embodiments.

The terms used in this specification generally have their ordinarymeanings in the art, within the context of the disclosure, and in thespecific context where each term is used. Certain terms that are used todescribe the disclosure are discussed below, or elsewhere in thespecification, to provide additional guidance to the practitionerregarding the description of the disclosure. For convenience, certainterms may be highlighted, for example using italics and/or quotationmarks. The use of highlighting has no influence on the scope and meaningof a term; the scope and meaning of a term is the same, in the samecontext, whether or not it is highlighted. It will be appreciated thatthe same thing can be said in more than one way. One will recognize that“memory” is one form of a “storage” and that the terms may on occasionbe used interchangeably.

Consequently, alternative language and synonyms may be used for any oneor more of the terms discussed herein, nor is any special significanceto be placed upon whether or not a term is elaborated or discussedherein. Synonyms for certain terms are provided. A recital of one ormore synonyms does not exclude the use of other synonyms. The use ofexamples anywhere in this specification including examples of any termdiscussed herein is illustrative only, and is not intended to furtherlimit the scope and meaning of the disclosure or of any exemplifiedterm. Likewise, the disclosure is not limited to various embodimentsgiven in this specification.

Without intent to further limit the scope of the disclosure, examples ofinstruments, apparatus, methods and their related results according tothe embodiments of the present disclosure are given above. Note thattitles or subtitles may be used in the examples for convenience of areader, which in no way should limit the scope of the disclosure. Unlessotherwise defined, all technical and scientific terms used herein havethe same meaning as commonly understood by one of ordinary skill in theart to which this disclosure pertains. In the case of conflict, thepresent document, including definitions will control.

What is Claimed is:
 1. A computer-implemented method for estimating anearest neighbor to a target data vector in a dataset, comprising:receiving, by a computer system, a request to determine the nearestneighbor to the target data vector; determining, by the computer system,a preliminary clustering of the dataset, the preliminary clusteringcomprising multiple preliminary clusters, the determining including: foran initial set of clusters of the dataset, computing a centroid for eachcluster in the initial set of clusters, hashing, by the computer system,the centroids into a set of buckets, identifying a subset of thecentroids that is hashed into the set of buckets into which a specifieddata vector from the dataset is hashed, determining a centroid from thesubset of the centroids that is nearest to the specified data vector,and classifying the specified data vector into a first preliminarycluster, of the preliminary clusters, that corresponds to the centroid;redistributing, by the computer system, data vectors between at leasttwo of the preliminary clusters to produce at least two redistributedclusters, the target data vector included in a first of the at least tworedistributed clusters; transmitting, by the computer system, the firstof the at least two redistributed clusters to a first computer systemand a second of the at least two redistributed clusters to a secondcomputer system; estimating, by the first computer system, a nearestneighbor of the target data vector within the first of the tworedistributed clusters, wherein the nearest neighbor is one of multipledata vectors in the dataset that is most similar to the target datavector; and outputting, by the first computer system and in response tothe request, information of the nearest neighbor to a user.
 2. Thecomputer-implemented method of claim 1, wherein determining apreliminary clustering comprises performing an iteration of a K-meansroutine.
 3. The computer-implemented method of claim 2, wherein theK-means routine applies a locality sensitive hashing (LSH) function fordetermination of a nearest centroid.
 4. The computer-implemented methodof claim 3, wherein the LSH is parameterized by a number of projectionsm, a quantization factor w and a number of repetitions L, and hashes aquery vector to a collection of buckets by: computing a first cost ofhashing the query vector into a collection of buckets based on m;computing a second cost of searching the data vectors in the union ofthe buckets based on L; and determining a sum of the first cost and thesecond cost.
 5. The computer-implemented method of claim 4, wherein: thefirst cost is proportional to a product of m and L; and the second costis proportional to a product of L and an expected number of the datavectors that are hashed to a bucket to which the query vector is hashedwith L equal to one.
 6. The computer-implemented method of claim 1,wherein estimating a nearest neighbor comprises applying a LSH function.7. The computer-implemented method of claim 1, further comprisingidentifying and compensating for at least one redundant vector prior toestimating the nearest neighbor of the target data vector.
 8. Thecomputer-implemented method of claim 7, wherein identifying at least oneredundant vector comprises randomly selecting a corpus of data vectorsfrom the data set and identifying at least two data vectors within thecorpus sharing the same or similar values.
 9. The computer-implementedmethod of claim 1, wherein redistributing data vectors comprisesdetermining a cardinality and a second-closest centroid distance for aplurality of the preliminary clusters.
 10. The computer-implementedmethod of claim 9, wherein redistributing data vectors comprises:determining a first preliminary cluster based upon a cardinality of thefirst preliminary cluster; determining a second preliminary clusterbased upon a distance to a second-closest centroid of the secondpreliminary cluster; verifying that the cardinality of the first clusterand the cardinality of the second cluster correspond to a relation; andredistributing data vectors between the first and second preliminaryclusters such that the cardinality of at least one of the clusters isreduced.
 11. The computer-implemented method of claim 10, whereinverifying that the cardinality of the first preliminary cluster and thecardinality of the second preliminary cluster correspond to a relationcomprises verifying that a ratio of the cardinalities of the firstpreliminary cluster and the second preliminary cluster does not exceed athreshold.
 12. A computer-readable storage device comprisinginstructions for estimating a nearest neighbor to a target data vectorin a dataset, the instructions comprising: instructions for receiving,by the computer system, a request to determine the nearest neighbor tothe target data vector; instructions for determining, by the computersystem, a preliminary clustering of the dataset, the preliminaryclustering comprising a plurality of preliminary clusters, thedetermining including: for an initial set of clusters of the dataset,computing a centroid for each cluster in the initial set of clusters,hashing, by the computer system, the centroids into a set of buckets,identifying a subset of the centroids that is hashed into the set ofbuckets into which a specified data vector from the dataset is hashed,determining a centroid from the subset of the centroids that is nearestto the specified data vector, and classifying the specified data vectorinto a first preliminary cluster, of the preliminary clusters, thatcorresponds to the centroid; instructions for redistributing datavectors between at least two of the preliminary clusters to produce atleast two redistributed clusters, the target data vector included in afirst of the at least two redistributed clusters; instructions fortransmitting, by the computer system, the first of the at least tworedistributed clusters to a first computer system and a second of the atleast two redistributed clusters to a second computer system;instructions for estimating, by the first computer system, a nearestneighbor of the target data vector within the first of the tworedistributed clusters, wherein the nearest neighbor indicates one ofmultiple data vectors in the dataset that is most similar to the targetdata vector; and instructions for outputting, by the first computersystem and in response to the request, information of the nearestneighbor to a user.
 13. The computer-readable storage device of claim12, wherein determining a preliminary clustering comprises performing aniteration of the K-means routine.
 14. The computer-readable storagedevice of claim 13, wherein the K-means routine applies a LSH functionfor determination of a nearest centroid.
 15. The computer-readablestorage device of claim 14, wherein the LSH is parameterized by a numberof projections m, a quantization factor w and a number of repetitions Land hashes a query vector to a collection of buckets, the methodcomprising: computing a first cost of hashing the query vector into acollection of buckets based on m; computing a second cost of searchingthe data vectors in the union of the buckets based on L; and determininga sum of the first cost and the second cost.
 16. The computer-readablestorage device of claim 15, wherein: the first cost is proportional to aproduct of m and L; and the second cost is proportional to a product ofL and an expected number of the data vectors that are hashed to a bucketto which the query vector is hashed with L equal to one.
 17. Thecomputer-readable storage device of claim 12, wherein estimating anearest neighbor comprises applying a LSH function.
 18. Thecomputer-readable storage device of claim 12, further comprisinginstructions for identifying and compensating for at least one redundantvector prior to estimating the nearest neighbor of the target datavector.
 19. The computer-readable storage device of claim 18, whereinidentifying at least one redundant vector comprises randomly selecting acorpus of data vectors from the data set and identifying at least twodata vectors within the corpus sharing the same or similar values. 20.The computer-readable storage device of claim 12, wherein redistributingdata vectors comprises determining a cardinality and a second-closestcentroid distance for a plurality of the preliminary clusters.
 21. Thecomputer-readable storage device of claim 20, wherein redistributingdata vectors comprises: determining a first preliminary cluster basedupon a cardinality of the first preliminary cluster; determining asecond preliminary cluster based upon a distance to a second-closestcentroid of the second preliminary cluster; verifying that thecardinality of the first cluster and the cardinality of the secondcluster correspond to a relation; and redistributing data vectorsbetween the first and second preliminary clusters such that thecardinality of at least one of the clusters is reduced.
 22. Thecomputer-readable storage device of claim 21, wherein verifying that thecardinality of the first preliminary cluster and the cardinality of thesecond preliminary cluster correspond to a relation comprises verifyingthat a ratio of the cardinalities of the first preliminary cluster andthe second preliminary cluster does not exceed a threshold.
 23. A systemfor estimating a nearest neighbor to a target data vector in a dataset,comprising: a component configured to receive a request to determine thenearest neighbor to the target data vector; a component configured todetermine a preliminary clustering of the dataset, the preliminaryclustering comprising a plurality of preliminary clusters, the componentfurther configured to determine by the preliminary clustering by: for aninitial set of clusters of the dataset, computing a centroid for eachcluster in the initial set of clusters, hashing, by the computer system,the centroids into a set of buckets, identifying a subset of thecentroids that is hashed into the set of buckets into which a specifieddata vector from the dataset is hashed, determining a centroid from thesubset of the centroids that is nearest to the specified data vector,and classifying the specified data vector into a first preliminarycluster, of the preliminary clusters, that corresponds to the centroid;a component configured to redistribute data vectors between at least twoof the preliminary clusters to produce two redistributed clusters, thetarget data vector included in a first of the two redistributedclusters; a component configured to transmit the first of the at leasttwo redistributed clusters to a first computer system and a second ofthe at least two redistributed clusters to a second computer system; acomponent configured to estimate a nearest neighbor of the target datavector within the first of the two redistributed clusters, wherein thenearest neighbor is one of multiple data vectors in the dataset that ismost similar to the target data vector; and a component configured tooutput, by the first computer system and in response to the request,information of the nearest neighbor to a user.
 24. The system of claim23, wherein the preliminary clustering is determined by performing aniteration of the K-means routine.
 25. The system of claim 24, whereinthe K-means routine applies a LSH function for determination of anearest centroid.
 26. The system of claim 25, wherein the LSH isparameterized by a number of projections m, a quantization factor w anda number of repetitions L and hashes a query vector to a collection ofbuckets, by: computing a first cost of hashing the query vector into acollection of buckets based on m; computing a second cost of searchingthe data vectors in the union of the buckets based on L; and determininga sum of the first cost and the second cost.
 27. The system of claim 26,wherein: the first cost is proportional to a product of m and L; and thesecond cost is proportional to a product of L and an expected number ofthe data vectors that are hashed to a bucket to which the query vectoris hashed with L equal to one.
 28. The system of claim 23, wherein thenearest neighbor is estimated by applying a LSH function.
 29. The systemof claim 23, further comprising a component configured to identify andcompensate for at least one redundant vector prior to estimating thenearest neighbor of the target data vector.
 30. The system of claim 29,wherein the at least one redundant vector is identified by randomlyselecting a corpus of data vectors from the data set and identifying atleast two data vectors within the corpus sharing the same or similarvalues.
 31. The system of claim 23, wherein the data vectors areredistributed by determining a cardinality and a second-closest centroiddistance for a plurality of the preliminary clusters.
 32. The system ofclaim 31, wherein the data vectors are redistributed by: determining afirst preliminary cluster based upon a cardinality of the firstpreliminary cluster; determining a second preliminary cluster based upona distance to a second-closest centroid of the second preliminarycluster; verifying that the cardinality of the first cluster and thecardinality of the second cluster correspond to a relation; andredistributing data vectors between the first and second preliminaryclusters such that the cardinality of at least one of the clusters isreduced.
 33. The system of claim 32, wherein it is verified that thecardinality of the first preliminary cluster and the cardinality of thesecond preliminary cluster correspond to a relation by verifying that aratio of the cardinalities of the first preliminary cluster and thesecond preliminary cluster does not exceed a threshold.